An automatic sequence search and analysis protocol (DomainFinder) based on PSI-BLAST and IMPALA, and using conservative thresholds, has been developed for reliably integrating gene sequences from GenBank into their respective structural families within the CATH domain database (http://www.biochem.ucl.ac.uk/bsm/cath_new). DomainFinder assigns a new gene sequence to a CATH homologous superfamily provided that PSI-BLAST identifies a clear relationship to at least one other Protein Data Bank sequence within that superfamily. This has resulted in an expansion of the CATH protein family database (CATH-PFDB v1.6) from 19,563 domain structures to 176,597 domain sequences. A further 50,000 putative homologous relationships can be identified using less stringent cut-offs and these relationships are maintained within neighbour tables in the CATH Oracle database, pending further evidence of their suggested evolutionary relationship. Analysis of the CATH-PFDB has shown that only 15% of the sequence families are close enough to a known structure for reliable homology modeling. IMPALA/PSI-BLAST profiles have been generated for each of the sequence families in the expanded CATH-PFDB and a web server has been provided so that new sequences may be scanned against the profile library and be assigned to a structure and homologous superfamily.
展开▼
机译:已经开发了一种基于PSI-BLAST和IMPALA并使用保守阈值的自动序列搜索和分析协议(DomainFinder),用于将GenBank中的基因序列可靠地整合到CATH域数据库中的各自结构家族中(http:// www。 biochem.ucl.ac.uk/bsm/cath_new)。只要PSI-BLAST识别与该超家族中至少一个其他Protein Data Bank序列的明确关系,DomainFinder就会为CATH同源超家族分配新的基因序列。这导致CATH蛋白家族数据库(CATH-PFDB v1.6)从19563个域结构扩展到176597个域序列。可以使用较不严格的界限来确定另外50,000个推定的同源关系,并将这些关系保留在CATH Oracle数据库的邻居表中,尚待进一步证明它们建议的进化关系。对CATH-PFDB的分析表明,只有15%的序列家族与已知结构足够接近,可以进行可靠的同源性建模。已为扩展的CATH-PFDB中的每个序列家族生成IMPALA / PSI-BLAST谱,并提供了Web服务器,以便可以针对谱库扫描新序列,并将其分配给结构和同源超家族。
展开▼